{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install mcp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "faq_text = \"\"\"Question 1: What is the first step before building a machine learning model?\n",
    "Answer 1: Understand the problem, define the objective, and identify the right metrics for evaluation.\n",
    "\n",
    "Question 2: How important is data cleaning in ML?\n",
    "Answer 2: Extremely important. Clean data improves model performance and reduces the chance of misleading results.\n",
    "\n",
    "Question 3: Should I normalize or standardize my data?\n",
    "Answer 3: Yes, especially for models sensitive to feature scales like SVMs, KNN, and neural networks.\n",
    "\n",
    "Question 4: When should I use feature engineering?\n",
    "Answer 4: Always consider it. Well-crafted features often yield better results than complex models.\n",
    "\n",
    "Question 5: How to handle missing values?\n",
    "Answer 5: Use imputation techniques like mean/median imputation, or model-based imputation depending on the context.\n",
    "\n",
    "Question 6: Should I balance my dataset for classification tasks?\n",
    "Answer 6: Yes, especially if the classes are imbalanced. Techniques include resampling, SMOTE, and class-weighting.\n",
    "\n",
    "Question 7: How do I select features for my model?\n",
    "Answer 7: Use domain knowledge, correlation analysis, or techniques like Recursive Feature Elimination or SHAP values.\n",
    "\n",
    "Question 8: Is it good to use all features available?\n",
    "Answer 8: Not always. Irrelevant or redundant features can reduce performance and increase overfitting.\n",
    "\n",
    "Question 9: How do I avoid overfitting?\n",
    "Answer 9: Use techniques like cross-validation, regularization, pruning (for trees), and dropout (for neural nets).\n",
    "\n",
    "Question 10: Why is cross-validation important?\n",
    "Answer 10: It provides a more reliable estimate of model performance by reducing bias from a single train-test split.\n",
    "\n",
    "Question 11: What’s a good train-test split ratio?\n",
    "Answer 11: Common ratios are 80/20 or 70/30, but use cross-validation for more robust evaluation.\n",
    "\n",
    "Question 12: Should I tune hyperparameters?\n",
    "Answer 12: Yes. Use grid search, random search, or Bayesian optimization to improve model performance.\n",
    "\n",
    "Question 13: What’s the difference between training and validation sets?\n",
    "Answer 13: Training set trains the model, validation set tunes hyperparameters, and test set evaluates final performance.\n",
    "\n",
    "Question 14: How do I know if my model is underfitting?\n",
    "Answer 14: It performs poorly on both training and test sets, indicating it hasn’t learned patterns well.\n",
    "\n",
    "Question 15: What are signs of overfitting?\n",
    "Answer 15: High accuracy on training data but poor generalization to test or validation data.\n",
    "\n",
    "Question 16: Is ensemble modeling useful?\n",
    "Answer 16: Yes. Ensembles like Random Forests or Gradient Boosting often outperform individual models.\n",
    "\n",
    "Question 17: When should I use deep learning?\n",
    "Answer 17: Use it when you have large datasets, complex patterns, or tasks like image and text processing.\n",
    "\n",
    "Question 18: What is data leakage and how to avoid it?\n",
    "Answer 18: Data leakage is using future or target-related information during training. Avoid by carefully splitting and preprocessing.\n",
    "\n",
    "Question 19: How do I measure model performance?\n",
    "Answer 19: Choose appropriate metrics: accuracy, precision, recall, F1, ROC-AUC for classification; RMSE, MAE for regression.\n",
    "\n",
    "Question 20: Why is model interpretability important?\n",
    "Answer 20: It builds trust, helps debug, and ensures compliance—especially important in high-stakes domains like healthcare.\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Question 1: What is the first step before building a machine learning model? Answer 1: Understand the problem, define the objective, and identify the right metrics for evaluation.',\n",
       " 'Question 2: How important is data cleaning in ML? Answer 2: Extremely important. Clean data improves model performance and reduces the chance of misleading results.',\n",
       " 'Question 3: Should I normalize or standardize my data? Answer 3: Yes, especially for models sensitive to feature scales like SVMs, KNN, and neural networks.',\n",
       " 'Question 4: When should I use feature engineering? Answer 4: Always consider it. Well-crafted features often yield better results than complex models.']"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_faq_text = [i.replace(\"\\n\", \" \") for i in faq_text.split(\"\\n\\n\")]\n",
    "new_faq_text[:4]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n",
    "from tqdm import tqdm\n",
    "\n",
    "def batch_iterate(lst, batch_size):\n",
    "    for i in range(0, len(lst), batch_size):\n",
    "        yield lst[i : i + batch_size]\n",
    "\n",
    "class EmbedData:\n",
    "\n",
    "    def __init__(self, \n",
    "                 embed_model_name=\"nomic-ai/nomic-embed-text-v1.5\",\n",
    "                 batch_size=32):\n",
    "        \n",
    "        self.embed_model_name = embed_model_name\n",
    "        self.embed_model = self._load_embed_model()\n",
    "        self.batch_size = batch_size\n",
    "        self.embeddings = []\n",
    "\n",
    "    def _load_embed_model(self):\n",
    "        embed_model = HuggingFaceEmbedding(model_name=self.embed_model_name,\n",
    "                                           trust_remote_code=True,\n",
    "                                           cache_folder='./hf_cache')\n",
    "        return embed_model\n",
    "    \n",
    "    def generate_embedding(self, context):\n",
    "        return self.embed_model.get_text_embedding_batch(context)\n",
    "    \n",
    "    def embed(self, contexts):\n",
    "        self.contexts = contexts\n",
    "        \n",
    "        for batch_context in tqdm(batch_iterate(contexts, self.batch_size),\n",
    "                                  total=len(contexts)//self.batch_size,\n",
    "                                  desc=\"Embedding data in batches\"):\n",
    "                                  \n",
    "            batch_embeddings = self.generate_embedding(batch_context)\n",
    "            \n",
    "            self.embeddings.extend(batch_embeddings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "<All keys matched successfully>\n",
      "Embedding data in batches: 1it [00:00,  3.42it/s]\n"
     ]
    }
   ],
   "source": [
    "batch_size = 32\n",
    "\n",
    "embeddata = EmbedData(batch_size=batch_size)\n",
    "\n",
    "embeddata.embed(new_faq_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "from qdrant_client import models\n",
    "from qdrant_client import QdrantClient\n",
    "\n",
    "class QdrantVDB:\n",
    "\n",
    "    def __init__(self, collection_name, vector_dim=768, batch_size=512):\n",
    "        self.collection_name = collection_name\n",
    "        self.batch_size = batch_size\n",
    "        self.vector_dim = vector_dim\n",
    "\n",
    "    def define_client(self):\n",
    "        self.client = QdrantClient(url=\"http://localhost:6333\",\n",
    "                                   prefer_grpc=True)\n",
    "        \n",
    "    def create_collection(self):\n",
    "        \n",
    "        if not self.client.collection_exists(collection_name=self.collection_name):\n",
    "\n",
    "            self.client.create_collection(collection_name=self.collection_name,\n",
    "                                          \n",
    "                                          vectors_config=models.VectorParams(\n",
    "                                                              size=self.vector_dim,\n",
    "                                                              distance=models.Distance.DOT,\n",
    "                                                              on_disk=True),\n",
    "                                          \n",
    "                                          optimizers_config=models.OptimizersConfigDiff(\n",
    "                                                                            default_segment_number=5,\n",
    "                                                                            indexing_threshold=0)\n",
    "                                         )\n",
    "            \n",
    "    def ingest_data(self, embeddata):\n",
    "        \n",
    "        for batch_context, batch_embeddings in tqdm(zip(batch_iterate(embeddata.contexts, self.batch_size), \n",
    "                                                        batch_iterate(embeddata.embeddings, self.batch_size)), \n",
    "                                                    total=len(embeddata.contexts)//self.batch_size, \n",
    "                                                    desc=\"Ingesting in batches\"):\n",
    "        \n",
    "            self.client.upload_collection(collection_name=self.collection_name,\n",
    "                                        vectors=batch_embeddings,\n",
    "                                        payload=[{\"context\": context} for context in batch_context])\n",
    "\n",
    "        self.client.update_collection(collection_name=self.collection_name,\n",
    "                                    optimizer_config=models.OptimizersConfigDiff(indexing_threshold=20000)\n",
    "                                    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Ingesting in batches: 1it [00:00, 112.12it/s]\n"
     ]
    }
   ],
   "source": [
    "database = QdrantVDB(\"ml_faq_collection\")\n",
    "database.define_client()\n",
    "database.create_collection()\n",
    "database.ingest_data(embeddata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Retriever:\n",
    "\n",
    "    def __init__(self, vector_db, embeddata):\n",
    "        \n",
    "        self.vector_db = vector_db\n",
    "        self.embeddata = embeddata\n",
    "\n",
    "    def search(self, query):\n",
    "        query_embedding = self.embeddata.embed_model.get_query_embedding(query)\n",
    "\n",
    "        # select the top 3 results\n",
    "        result = self.vector_db.client.search(\n",
    "            collection_name=self.vector_db.collection_name,\n",
    "            \n",
    "            query_vector=query_embedding,\n",
    "            \n",
    "            search_params=models.SearchParams(\n",
    "                quantization=models.QuantizationSearchParams(\n",
    "                    ignore=True,\n",
    "                    rescore=True,\n",
    "                    oversampling=2.0,\n",
    "                )\n",
    "            ),\n",
    "            limit=3,\n",
    "            timeout=1000,\n",
    "        )\n",
    "\n",
    "        context = [dict(data) for data in result]\n",
    "        combined_prompt = []\n",
    "\n",
    "        for entry in context[:3]:\n",
    "            context = entry[\"payload\"][\"context\"]\n",
    "\n",
    "            combined_prompt.append(context)\n",
    "\n",
    "        final_output = \"\\n\\n---\\n\\n\".join(combined_prompt)\n",
    "        return final_output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = Retriever(database, embeddata).search(\"How to prevent overfitting?\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
